Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text

نویسندگان

Alexandra Antonova

Alexey Misyurev

چکیده

We describe a set of techniques that have been developed while collecting parallel texts for Russian-English language pair and building a corpus of parallel sentences for training a statistical machine translation system. We discuss issues of verifying potential parallel texts and filtering out automatically translated documents. Finally we evaluate the quality of the 1-millionsentence corpus which we believe may be a useful resource for machine translation research.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

C-3MA: Tartu-Riga-Zurich Translation Systems for WMT17

This paper describes the neural machine translation systems of the University of Latvia, University of Zurich and University of Tartu. We participated in the WMT 2017 shared task on news translation by building systems for two language pairs: English↔German and English↔Latvian. Our systems are based on an attentional encoder-decoder, using BPE subword segmentation. We experimented with backtran...

متن کامل

Building Bilingual Corpus based on Hybrid Approach for Myanmar-English Machine Translation

Word alignment in bilingual corpora has been an active research topic in the Machine Translation research groups. In this paper, we describe an alignment system that aligns English-Myanmar texts at word level in parallel sentences. Essential for building parallel corpora is the alignment of translated segments with source segments. Since word alignment research on Myanmar and English languages ...

متن کامل

Acquiring Translation Equivalences of Multiword Expressions by Normalized Correlation Frequencies

In this paper, we present an algorithm for extracting translations of any given multiword expression from parallel corpora. Given a multiword expression to be translated, the method involves extracting a short list of target candidate words from parallel corpora based on scores of normalized frequency, generating possible translations and filtering out common subsequences, and selecting the top...

متن کامل

Building Parallel Corpora for SMT System: A Case Study of English-Manipuri

The Statistical Machine Translation (SMT) systems are developed using sentence aligned parallel corpus. The difficulty is that there is no parallel corpus at the required measure for many language pairs. The preparation of large scale parallel corpus takes time and demands the linguistics skill. In the present work, the various issues of a quality parallel corpus and a technique that extracts p...

متن کامل

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora,...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text

نویسندگان

چکیده

منابع مشابه

C-3MA: Tartu-Riga-Zurich Translation Systems for WMT17

Building Bilingual Corpus based on Hybrid Approach for Myanmar-English Machine Translation

Acquiring Translation Equivalences of Multiword Expressions by Normalized Correlation Frequencies

Building Parallel Corpora for SMT System: A Case Study of English-Manipuri

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

عنوان ژورنال:

اشتراک گذاری